UPSTREAM PR #17116: rpc : fix alloc size logic by DajanaV · Pull Request #262 · auroralabs-loci/llama.cpp

DajanaV · 2025-11-18T13:41:50Z

fix #16657
ref ggml-org/llama.cpp#16276 (review)

This fixes the RPC inference when Metal backend is involved.

Testing:

# server
make -j && ./bin/rpc-server

# cli
make -j && ./bin/llama-cli -m ../models/gemma-3-4b-it/ggml-model-f16.gguf --rpc localhost:50052 -ngl 99 --no-mmap -no-cnv -p "Hello" --top-k 1 -n 32 -fa on

TODO:

Check performance imapct
Cache the responses to avoid extra RPC calls?

loci-review · 2025-11-18T14:24:44Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: RPC Allocation Size Logic Fix

Overview

PR #262 implements a fix for RPC inference when Metal backend is involved, addressing allocation size calculation logic in the RPC system. The changes are contained within the GGML RPC subsystem (ggml-rpc.h and ggml-rpc.cpp) and do not modify core inference functions.

Analysis Results

Performance Metrics: No performance data was available for the specified version comparison, indicating either incomplete analysis pipeline execution or that the changes are too localized to generate measurable performance differences in the core inference path.

Code Changes Scope: The modifications are limited to:

RPC protocol version bump (breaking change requiring client/server sync)
Enhanced allocation size request structure to include source tensors
Null pointer safety improvements in tensor serialization
Expanded allocation logic for specific operations (GGML_OP_FLASH_ATTN_EXT, GGML_OP_MUL_MAT_ID)

Core Function Impact: The changes do not affect primary inference functions (llama_decode, llama_encode, llama_tokenize) or other performance-critical components identified in the project structure. The modifications are isolated to RPC backend allocation logic.

Network and Memory Impact: The fix introduces additional RPC message overhead by serializing source tensors (GGML_MAX_SRC * sizeof(rpc_tensor) per allocation request) and increases server-side memory allocation. However, this overhead only affects distributed inference scenarios using RPC backends.

Correctness Benefits: The implementation addresses a fundamental issue where allocation size calculations were insufficient for certain tensor operations, particularly affecting Metal backend compatibility. The fix prevents potential allocation failures that could cause crashes or incorrect results in distributed inference setups.

Binary Impact: Changes affect RPC-enabled binaries (llama-cli, rpc-server) when used with distributed inference configurations. Standard local inference remains unaffected.

The changes represent a targeted correctness fix with minimal performance impact on typical usage patterns. The modifications improve system reliability for distributed inference scenarios while maintaining compatibility with existing local inference workflows.

loci-review · 2025-11-28T18:12:06Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary

Project: llama.cpp (auroralabs-loci)
PR #262: RPC allocation size logic fix for Metal backend compatibility
Versions Compared: aa09cdea (base) vs 135d56f6 (target)

Analysis Result

No performance changes detected between the baseline and target versions. All 16 binaries show 0.0% change in power consumption. Function-level metrics indicate no measurable differences in response time or throughput across the codebase.

Code Changes: The PR modifies RPC protocol message structures in ggml-rpc.cpp and ggml-rpc.h to include source tensor serialization for allocation size queries. These changes affect RPC communication logic but do not alter the compiled binary behavior for the analyzed versions, suggesting the modifications may not be active in the current build configuration or the analysis captured identical build artifacts.

Inference Impact: No impact on tokens per second. Core inference functions (llama_decode, llama_encode, llama_tokenize) show no response time or throughput changes.

DajanaV temporarily deployed to PROD__AL_DEMO November 18, 2025 13:41 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 28 times, most recently from ab559ce to e612b7c Compare November 24, 2025 22:10

loci-dev force-pushed the main branch 9 times, most recently from 9239ee7 to 96dc574 Compare November 28, 2025 16:10

loci-dev force-pushed the upstream-PR17116-branch_ggml-org-gg/rpc-fix-alloc-size branch from 590a805 to 4953693 Compare November 28, 2025 17:35

loci-dev temporarily deployed to PROD__AL_DEMO November 28, 2025 17:35 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 16 times, most recently from 9368c2d to 50d76f4 Compare December 1, 2025 09:13

ggerganov added 2 commits December 5, 2025 13:15

rpc : fix alloc size logic

50f2d86

rpc : bump version

73c5330

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17116: rpc : fix alloc size logic#262

UPSTREAM PR #17116: rpc : fix alloc size logic#262
DajanaV wants to merge 2 commits intomainfrom
upstream-PR17116-branch_ggml-org-gg/rpc-fix-alloc-size

DajanaV commented Nov 18, 2025

Uh oh!

loci-review bot commented Nov 18, 2025

Uh oh!

loci-review bot commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

DajanaV commented Nov 18, 2025

Uh oh!

loci-review bot commented Nov 18, 2025

Performance Analysis Summary: RPC Allocation Size Logic Fix

Overview

Analysis Results

Uh oh!

loci-review bot commented Nov 28, 2025

Performance Analysis Summary

Analysis Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants